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Abstract 

Many regularization schemes for high-dimensional regression have been put forward. Most require 
the choice of a tuning parameter, using model selection criteria or cross-validation schemes. We show 
that a simple non-negative or sign-constrained least squares is a very simple and effective regularization 
technique for a certain class of high-dimensional regression problems. The sign constraint has to be 
derived via prior knowledge or an initial estimator but no further tuning or cross-validation is necessary. 
The success depends on conditions that are easy to check in practice. A sufficient condition for our 
results is that most variables with the same sign constraint are positively correlated. For a sparse 
optimal predictor, a non-asymptotic bound on the Ll-error of the regression coefficients is then proven. 
Without using any further regularization, the regression vector can be estimated consistently as long as 
log(p)s/n — > for n — > oo, where s is the sparsity of the optimal regression vector, p the number of 
variables and n sample size. Network tomography is shown to be an application where the necessary 
conditions for success of non-negative least squares are naturally fulfilled and empirical results confirm 
the effectiveness of the sign constraint for sparse recovery. 



1 Introduction 



High-dimensional regression problems are characterized by a large number of predictor variables in relation 
to sample size. Regularization (in a broad sense) is of critical importance for high-dimensional problems 
and much attention has been paid to various schemes and their properties in recent years, including the 



Ridge estimator (Hoerl and Kennard 19701, non-negative Garrotte (Breiman 1995), the Lasso (Tibshirani 



1996) and various variations of the latter, including the group Lasso (Yuan and Lin 2006) and adaptive 



Lasso (Zou 20061. Datasets with very low signal-to-noise ratio offer similar challenges to high-dimensional 



problems even if the notional sample size is quite high. 

Sign-constraints on the regression coefficients are a simpler regularization and have been first advocated 



by I.J. Good, as covered in the book Lawson and Hanson (1995). There is a wide range of problems where 



the sign of the regression coefficients can either be estimated by an initial estimator or where it is known a 



priori, such as in image processing and spectral analysis (Waterman 1977 Bellavia et al. 2006 Donoho et al. 



1992 Chen and Plemmons 2009). Sign-constraints have also been implemented for matrix factorizations, 



specifically the non-negative Matrix factorization ( Lee et al. 1999 Lee and Seung 2001 Ding et al. 2010) 



and non-negative least squares regression can be a useful tool for this factorization (Kim and Park 2007) 



We study the performance of non-negative least squares type problems under a so-called Positive Eigenvalue 
Condition, which can be checked for any given dataset by solving a quadratic programming problem. A 
sufficient condition uses only the minimum of all entries in the design matrix. It is shown that non-negative 
(or, in general, sign-constrained) least squares is a surprisingly effective regularization technique for high- 
dimensional regression problems under these conditions. If the Positive Eigenvalue Condition is not fulfilled, 
the sign constraint is still a good ingredient in a regularization framework. The non-negative Garrote 
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(Breiman 1995) is, for example, making use of a sign-constraint, where the signs are derived from an initial 



estimator as is the positive Lasso (Efron et al. 2004) 



The data are assumed to be given by a n x 1- vector of real- valued observations Y and anx p-dimensional 
matrix X, where column k of X contains all n samples of the fc-th predictor variable for k = 1, . . . ,p. The 
non-negative least squares (NNLS) regression estimator is defined as 



$ := argming || Y — X/3|| | such that 



min fik > 0. 

k 



(1) 



We will work with a positivity constraint without limitation of generality since variables that are constrained 
to be negative can be replaced by their negative counterpart and the problem can thus always be framed 
as a non-negative least squares optimisation. Problem Q is a convex optimization problem and can be 
solved with general quadratic programming problem solvers, including active set (Lawson and Hanson 



1995), iterative (Kim et al. 20061 and interior-point approaches (Bellavia et al. 2006). A tailor-made fast 



approximate algorithm based on random projections has recently been proposed in Boutsidi s and Drineas| 
(2009). The recent manuscript Slawski et al. (2011) contains independent work on the behaviour of NNLS 



in high-dimensions. Using the same Positive Eigenvalue Condition (which is called self-regularizing design 
condition) , a bound on the prediction error of NNLS and a sparse recovery property after hard thresholding 
are shown in Slawski et al. (2011). Our main focus is on sparse recovery in the ^i-sense. The bounds 
on prediction error are also of different nature since the assumptions are different. We make use of the 
so-called compatibility condition which is appears in most sparse recovery results in the €i-norm penalized 



estimation literature (Van De Geer and Buhlmann 2009[ ) and derive, with the help of this condition, tight 
non-asymptotic bounds on the prediction error. 

Note that the non-negative least squares estimator ([!]) does not require the choice of a tuning param- 
eter beyond choosing the sign of the coefficients. Imposing a sign-constraint might seem like a very weak 
regularization but it will be shown that the estimator is remarkably different from the un-regularized least 
squares estimator. It can cope with high-dimensional problems, where the number of predictor variables 
vastly exceeds sample size. It will be shown to be a consistent estimator as long as the underlying optimal 
prediction is sufficiently sparse (ie using only a small subset of all predictor variables) and the so-called 
Positive Eigenvalue Condition is fulfilled. 

The manuscript is organized as follows. The notation and the main two assumptions, the compatibility 
and Positive Eigenvalue Condition, are introduced in Section[2] Our main result, a £i-bound on the difference 
between the NNLS estimator and the optimal regression coefficients, is shown in Section |3j along with a 
bound on the prediction error. 



2 Notation and Assumptions 

We assume that the n samples Y € R" are drawn from X/3* + e for some p-dimensional vector f3* with 
min*. /3% > and e ~ 7V"(0, a 2 ) for some a > 0. Let S be the set of non-zero entries of the optimal solutions, 
S : {k : fil 7^ 0} and N = S c be the complement of S. We could also let /3* be the best approximation to the 
data-generating model under positivity constraints but will refrain from doing so for notational simplicity. 
We assume that the columns of X are standardized to ^2-norm of n. Despite not necessarily assuming that 
the columns are mean-centered, we call S = n _1 X T X the covariance matrix throughout. 

We make two major assumptions for the main result, one about sparse eigenvalues and another about 
the positive eigenvalue between predictor variables. 



2.1 Compatibility Condition 



There has been much recent work on the properties of the Lasso (Tibshirani 1996). Many similar conditions 



for success of the Lasso penalization schemes have been derived (for example Zhang and Huang 2008 



Meinshausen and Yu[ [20091 |Wainwright| [20091 |Bunea et al.[ [20071 |2006l |Van De GeeF 



2008j_[Bickel et al. 



2009). A good overview of all conditions and their relations is given in Van De Geer and Buhlmann (2009). 



The weakest condition is based on the notion of (L, S) restricted ^-eigenvalues. 
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The (L, S) restricted ^-eigenvalue of matrix A is defined as: 



(L,S,A) ^ m m{s^^:(3en(L,S)}, 



compatible 



where U(L,S) = {P : \\Pn\\i < L\\f3 s \\i} and s = \S\. 

A lower bound on this restricted eigenvalue is necessary for success of the Lasso, either in a prediction 



loss or coefficient recovery sense and was called the compatibility condition in Van De Geer and Biihlmann 
( 2009 ) . It was shown to be weaker than all similar conditions such as the Restricted Isometry Property 



(Candes and Tao 2007). 



We make the following assumption. 

Assumption 1 (Compatibility Condition). There exists some <j) > such that the (L, S) -restricted l\- 
eigenvalue 4> 2 compatme (L, S,X) > <j>. 

The value of L will be specified in Theorem [I] 

Remark 1. The assumption is formulated for the empirical covariance matrix X but can also easily be 
reformulated on the population covariance matrix X for random design. Assume that the maximal difference 
between the population and empirical covariance matrix is bounded by S > 0, that is ||X — X||oo < S. This 
assumption is fulfilled with high probability for many data sets with larger sample size. If the predictors have 
for example a multivariate normal distribution (which will not be assumed elsewhere), then the condition is 
fulfilled with proba bility I — 2exp(— t) for 8 > y/u+u withu — (4i+8 log (p))/n, see (10.1) in 



Biihlmann 



Van De Geer and 



(SOOfj. IfS< ^/(ML + l) 2 s), then ^ amfata>le (L,S,Jl) > <P implies <t> 2 compatible (L, S, X) > 0/2 
The proof follows from the inequality a2 ' ,r c v ■* 

in 



Van De Geer and Biihlmann 



compatible^, 5,53) > ^atiUe (L, S, 53) - (L+ 1) VSs m Corrolary 10.1 
2009). The Compatibility Condition could thus be imposed on the population 



covariance matrix instead of the empirical covariance matrix. 



2.2 Positive eigenvalue condition 

The following Positive Correlation Condition is the main assumption necessary to show success of non- 
negative least squares. 

The positively constrained minimal l\- eigenvalue of matrix A is defined as 

<^ os (A) := min|-p|j2- : min/3 fe > 0|, 

A lower bound on this restricted eigenvalue will be a sufficient condition for sparse recovery success of NNLS. 

Assumption 2 (Positive Eigenvalue Condition). There exists some v > such that </>p OS (53) > v. 

A lower bound on this eigenvalue seems to be a much stricter condition than the Compatibility Condition. 
However, the latter allows for positive and negative regression coefficients, while the Positive Eigenvalue 
Condition is restricted to positive coefficients. There are thus some immediate examples where it is fulfilled, 
which we discuss below. 



Example I: strictly positive covariance matrix. The Positive Correlation Condition is fulfilled if 
mini j X > v > 0, that is all entries in the covariance matrix are strictly positive. Again, this condition could 
also be formulated for the population covariance matrix, using a bound on ||X — X^. 

We also remark on the case of general sign-constraints (some variables constrained to be positive, some 
negative). The condition applies then to the dataset where all variables with a negativity constraint have 
been replaced with their negative counterparts. The constraint on the original covariance matrix is thus that 
it forms two blocks. The variables in the first block are the variables with a positivity constraint and the 
second block is formed by all variables with a negativity constraint. Correlations are required to be positive 
within a block and negative between blocks. 

A generalization of Example I is the following. 
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Example II: only few negative entries. Let A := {i : < for some 1 < j ' < p} be the minimal set 
such that Sjj < implies {i,j} C „4 for all 1 < i,j < p. The Positive Eigenvalue Condition is fulfilled if 
both of the conditions below are fulfilled for some v > 0. 

1. All entries of the covariance matrix are strictly positive on A c , that is Sjj > 2u if C A c for all 
1 < i,j < ™- 

2. A restricted eigenvalue condition holds on the set .4, ie 

min { imr : & = for a11 fc e A c } > iv. 

IIpIIi j 

If the set *A is very small, in particular much smaller than n, the latter restricted ^i-eigenvalue condition 
is in general not very restrictive. The important criterion is thus whether the set A is small compared to 
the sample size. 



Example III: block matrix. For apx p-matrix A and a set K C {1, . . . ,p}, let A^k be the \K\ x \K\- 
submatrix formed by all elements in set K. Suppose 

1. Entries of the covariance matrix can be negative but fulfil S^- > —p/p 2 for all 1 < i,j < n and some 
p > 0. 

2. The set of variables {l,...,p} can be partitioned into B > 1 blocks Bj C such that 
4>l os {± BjBj ) >{v + p)B for all j = 1, ... , B. 

A more specific example is thus: all entries in S are larger than —p/p 2 for some p > and > + 
if both i, j are within the same block. 

The Positive Eigenvalue Condition is fulfilled with parameter v > 0. 

The positive aspect of the condition is that it is very easy to check in practice whether it applies (at least 
approximately) and whether one would thus expect the bounds shown below to apply to a given dataset. 



3 Main Results 

It will be shown that non-negative least squares leads to a good recovery of the optimal sparse regression 
vector for high-dimensional data. We study the l^-error in the regression vector, which also yields a bound 
on the £ 2 -error and prediction loss. 

Theorem 1. Assume that the Positive Eigenvalue Condition holds with v > 0. Choose any < i] < 1/3. 
Assume that the compatibility condition holds with 4> > for L = 4z/ _1 . Setting 

and assuming min^s [3k > K p . v a/y/n<j), it then holds with probability at least 1 — rj that 

114 -/Hi! < WV^ + Vv^)^ (2) 



A proof is in the appendix. 

The result might be surprising since it implies that non-negative least squares is succeeding in recovering 
the regression coefficients in an £i-sense if log(p)s/ v / n — > for n — > oo, a scaling that requires for general 
design a lot more regularization in the form of Lasso penalties (or similar) . 

The result does not imply exact sign recovery in the sense that the non-zero coefficients equal exactly 
the set S (and indeed this will in general not be the case), but it implies that the S largest coefficients 
correspond to the variables in the set S. 
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Figure 1: Left: A network with three internal nodes and three leaf nodes. The (unobservable) losses at 
the internal nodes are (10,10,0), meaning that the first two nodes lead to a loss rate of 10 and the third 
node is not leading to any losses. The observations of the loss rates at the leaf nodes are then (8,3,9). Using 
the observations at the leaf nodes and knowledge of the topology, NNLS can correctly identify the two first 
nodes as responsible for the losses. Right: A network with 78 internal nodes and 22 leaf nodes. Two of 
the internal nodes have a positive loss (marked with a dot) and the observations at the leaf nodes are again 
sufficient to pinpoint the (unknown) location of the two nodes using NNLS estimation. 



Corollary 1. Under the same conditions as Theorem [7] and the stronger assumption that the minimum 
over all non-zero coefficients is bounded from below by minj, S 5 (3^ > 2K p „o~ (5 jv + 4/y / 0)s/y / n, it holds with 
probability at least 1 — r/ that the indices of the s largest absolute coefficients in (3 are identical to the set S . 

This follows immediately from Theorem [l] since the ^-bound on the difference between f3 and f3* implies 
the same bound in the supremum-norm. 

The bound in Theorem [I] also implies a bound on the prediction error. 

Theorem 2. Under the same conditions as Theorem^ with probability at least 1 — 77 for any < 77 < 1/3, 

|| X( ^™cie_^||2 < 2Kiy(5/v + 2/^)s. 

A proof is given in the appendix. The mean squared error, introduced by using NNLS instead of the 
oracle estimator is thus proportional to log(p) 2 s/n. The result implies asymptotically vanishing prediction 
error if s\og(p) 2 /n — > for n — » n. 



4 Numerical Results 



The results above imply that NNLS can be very effective if (a) the sign of regression coefficients is known 
or can easily be estimated and (b) the Positive Eigenvalue Condition holds. Network tomography is a good 
example (Castro et al. 20041). For others, including image analysis and applications in signal processing, 



see Slawski et al. (20111. There are different aspects of network tomography, including origin-destimation 
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matrix estimation and link-level network tomography; see 



Castro et al. ( 2004 ) for a good overview of the 



statistical aspects and Xi et al. (2006) and Lawrence et al. (2006) for a discussion of active tomography in 
the context of link-level analysis. Wc will focus on one aspect of the link-level network tomography. The 
network consists of nodes arranged in a directed acyclic graph (or sometimes as a special case a tree) and 
measurements can be taken at the leaf nodes. These measurements are used to infer the state of all the 
nodes in the network. In a communication network, the measurements can be the delay or loss rate of 
packages, in a transport network (such as water distributions networks) it can be the shortfall of the flow 
rate compared to the expected rate. Since the network topology is assumed to be known, the measurements 
consist typically of noise plus a linear combination of the internal and unobservable states of the nodes in 
the network. If a node in the network has a loss (be it in the form of delaying packages or loss of water flow) , 
it will have a linear effect on all leaf nodes that are descendants of the node in the directed acyclic graph. 

Figure[T]shows a toy example. Imagining a flow passing through the tree from the internal nodes to the leaf 
nodes, the entry X, j is the proportion of flow in node j that reaches leaf node i if flow is divided equally among 
all outgoing edges in each node of the tree. Three internal nodes have loss rates (/3i, fa, fiz) = (10,10,0). 
The loss rates Y = (Y 1 ,Y 2 , Y 3 ) at the three leaf nodes are then given by Y = X/? + e for some i.i.d. noise e 
and 

/ 0.3 0.5 
X= 0.3 0.5 
\ 0.4 0.5 0.5 

A positivity constraint on the coefficient vectors is clearly appropriate since there will in general not be a 
negative loss at internal nodes (for example no unexected gain of water in a distribution network). In the 
noiseless case, the NNLS solution recovers exactly the internal states (10, 10, 0) and thus identifies correctly 
the first two nodes as responsible for the loss of the flow rate in all three leaf nodes. In this simple example, 
the number of leaf nodes is equal to the number of internal nodes and ordinary least squares would also work 
in the noiseless case. Least squares clearly ceases to be useful once the number of internal nodes exceeds 



the number of leaf nodes. Note that, contrary to the previous literature (for example Castro et al. (20041; 
Lawrence et al. (2006)) we do not attempt to fit a stochastic model to the observations. We are merely 



trying to directly estimate the current internal state /3 of the nodes in the network as accurately as possible. 

The theory suggests that a non-negativity constraint can already be very powerful under certain con- 
straints on the design matrix. The main condition is the Positive Eigenvalue Condition. In our simple 
network tomography example, it is obvious that all entries in X are positive and the same is hence true 
for X = n _1 X T X. Entries in X correspond to the amount of loss (delay of packages or reduction in flow 
rate) in a leaf node caused by a specific loss at an internal node and is non-zero if and only if there is a 
connection between the internal and the leaf node. Suppose that all non-zero entries in X have entries at 
least as large as S for some 8 > 0. Suppose further that we can group all internal nodes into B blocks such 
that the internal nodes within a block share at least one leaf node to which they all connect. The Positive 
Eigenvalue Condition is then fulfilled with value S 2 /B; see Example III in the discussion of the condition. 

The theory seems to show that under these conditions the NNLS-regularization is effective. To test this, 
we examine the effect of placing an additional ^i-constraint on the coefficient by computing 



P x := argming ||Y - Xj3\ 



such that min/3j. > and \\j3\\ 



< A. 



(3) 



Let $ be again the NNLS-solution defined in (|lj). It is obvious that (3 X = f3 for all A > A max for A max := 

We generate networks of similar type as the ones shown in Figure [l] The number N of total nodes 
is chosen for each of 1000 simulations uniformly out the set {25,50,100,200,400}. Nodes are distributed 
uniformly on the area [— 1, l] 2 and numbered in order of their Euclidean distance from the origin. Starting 
with the first node k = 1 closest to the origin, edges are drawn between it and its K nearest neighbours 
with a larger ordering number (where K is drawn uniformly from the set {5, 10, 20}). When drawing edges 
at node k — 1, .. . , N — 1, they are deleted with probability v (where v is drawn uniformly from the set 
{.2, .4, .6, .8, 1}) or when the edge would cross a previously drawn edge. Imagining again a flow passing 
through the tree from the internal nodes to the leaf nodes, the entry X{ j is the proportion of flow in node 
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^-max 

X 

Figure 2: The average number of correctly identified internal nodes with a positive loss under 1000 different 
scenarios with an additional ^-constraint as in (|3j). The NNLS solution corresponds to A = A max and is seen 
to be in general superior to the solutions under additional shrinkage. 



j that reaches leaf node i if flow is divided equally among all outgoing edges in each node of the tree. For 
each of the 1000 simulations, we draw a single graph from the parameters as described above and also draw 
the noise variance uniformly from the set {0,0.125,0.25,0.5,1,2,4} and a number s of non-zero entries in 
/3 (corresponding to nodes with a delay or loss), where s is drawn uniformly from the set {2,5,10}. The 
s non-zero entries from j3 are generated independently as the absolute value of a standard-normal random 
variable. For each such setting, we simulate 50 times the vector Y and reconstruct with (3 X as in |3j) for 
an evenly spaced grid of 20 points between A = and A = A max , the NNLS solution. Nodes are put in 
decreasing order of the reconstructed value /3 A . We record the first entry in the re-ordered vector (3 X that 
corresponds to a false positive (a zero entry in the equally re-ordered vector /3) and call the number of true 
positives the number of values of /3 A with larger value than the first false positive. 

Figure [2] shows the average number of true positives as a function of A. Each line corresponds to the 
average value over all 50 simulations in a given scenario. For nearly all scenarios there is no benefit in placing 
an additional ^i-penalty on the coefficients. The NNLS solution is thus a very good and simple estimator in 
these settings, as expected from theory. Additional regularization by an £i-penalty does not seem to improve 
results. 

5 Discussion 

We have shown that non-negative (or sign-constrained) least squares can be an effective regularization 
technique for sparse high-dimensional data under two conditions: (a) the data fulfil the so-called Positive 
Eigenvalue Condition, which is easy to check for a given dataset, and (b) the sign of the coefficients is known 
or can easily be estimated. If the conditions hold, NNLS can recover the correct sparsity pattern in the 
absence of any further shrinkage, as long as log(p)s/n — > for n oo, where p is the number of variables, 
s the number of non-zero variables in the optimal regression vector and n is sample size. We have shown 
network tomography as an example where the sign of regression coefficients is known a priori and the design 
condition is fulfilled automatically, at least approximately. In other examples the sign can be estimated by 
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an initial estimator. An attractive feature of NNLS is that it does not require any tuning parameter beyond 
the choice of the signs of the individual regression coefficients. Despite its simplicity, it can remarkably 
accurate for high-dimensional regression. 



6 Appendix: Proofs 

6.1 Proof of Theorem [T] 

First, for any C > 0, 1 - $(C) < (2tt)- 1 / 2 C 1 exp(-C 2 /2). Choosing C 2 = K 2 V = 21og(^§|), it follows 
with 77 < 1/3 and hence C > 1 that 1-$(C) < rj/(2p). Thus l-(p+s)(l-$(C)) > l-(2p)(l-$(C)) > 1 — 77 
and the results follow hence from Lemma [T] 

6.2 Proof of Theorem [2] 

Define the oracle non-negative least squares solution as 

parade ._ argmin || Y - X0\\% such that min k > and N = 0, (4) 

k 

and let 6/3 = 0- 0° racle . 

Let M be the set M := {k : 50k < 0}. Using Equation ^ in the proof of Lemma [IJ it follows that, 
with probability at least 1 — (p+ s)(l - $(C)), 



50 1 Y,80 < 2C<7||^e|| 1 /Vn 

and, using ||<5/3m c ||i < ll^lli an d the bound in Q for the latter quantity, it holds with probability at least 
1 - (p + s)(l - $(C)) that 

50 1 'E50 < 2C' z a z (5v- 1 + 2<p-'- /2 )- 



n 

V2p • 



Using again C = K 2 n = 21og(^=|), the claim follows. 
6.3 Lemmata 

Lemma 1. Assume that the Positive Eigenvalue Condition holds with v > 0. Choose any C > 0. Assume 
that the compatibility condition holds with <fi > for L = Av^ 1 and mhifcgs 0k > C a / s/ruj) . It then holds 
with probability at least 1 — (p + — $(C)) that 

\\0-0*h < Co-{h/v + 4/V0)^=. 



Proo/. By the definition (l| of /3 and definition |4j) of /3°™ de , 

<5£ = argmin ||Y - X0 oracle - X T ||| such that 7fc > -0° k racle for all fc = 1, . . . ,p. (5) 



The bound for \\0 - 0*\\ x follows as \\0 - 0*\\i < \\0° racle - 0*\\ x + \\50\\x. Using Lemma[2j it holds with 
probability exceeding 1 — (p + s)(l — $(6*)), 



parade _ < 2 C*a^ 1/2 4=, (6) 

\/n 

and it thus remains to be shown that, if |6]) is fulfilled, also 

\\50\\ 1 <C<r(5v- 1 +2<f ) - 1 / 2 )^. (7) 



s 

7n 



Let R = Y - X(3 oracle . Since 5/3 = is a feasible solution in (|5j), we have that 

(S/3 T X T X5/3 - 2R T X<^ < 0. 

Let 

M := {fc : 5p k < 0} (8) 
By the definition of the estimator M C. S and N C Af c . By Lemma[3j with probability at least 1— p(l— $(C)), 

max R T Xfe < CaJTi. 

keN 

By Lemma [2J with probability at least 1 — s(l — $(C)), R T Xfe = for all k £ S. Hence, taken together, 
with probability at least 1 — (p+ s)(l - $(C)), 



and thus 
Now, 



R T X^<(max R T X fe ) \\8p M A\i < CaVn\\Sp M o\\i 

<5/3 T £<5/3 < 2Ca\\6p M o\\i/Vn. (9) 

i£M,j£M° 

> SpJfctSpMc - 2\\Sp M \\ 1 \\8p M c\\ 1 

> v\\Sp MB \\l- 2\\6p M \\i \\5Pm* ||i, (10) 

having used the normalization to 1 of all columns of X (which bounds the absolute values of all entries in X 
by 1) for the second term in the second last inequality and the Positive Eigenvalue Condition for the first 
term in the last inequality (together with the fact that mimjgAfc Pk > by definition of M in (j8J)). Using 
this bound in ([9| and dividing by ||<5/3m c |!i yields that, with probability at least 1 — (p + s)(l — $(C)), 



\\5p M ° Hi < 2 v - 1 (Ca/V^+\\SPM\\i) 



2v~ x 



I sp 



f^= + l)\\sp u \U (11) 



Evidently ||<5/?m||i < Ccr/y/n is either true or not. If it is true, then it follows trivially from the first inequality 
in jTD that \\dp M 4i < Av^Ca/y/ri and hence ||<5/?||i < (1 + ip-^Ca/y/n < hv^Ca/s/n, and the bound 
in ( 7|) holds true. 

Alternatively, if ||<5/?m ||i > Ca/y/n we have from the second inequality in (11) that ||<5/3m c ||i < £||^/3m||i 
for L = 4z/ _1 and thus, using N C M c , also ||<5/3jv||i < L\\SPs\\i- The vector Sp is then in TZ(L,S). Using 
the compatibility condition, it follows that 

Sp T ±6p > £\\SP\\{. 
s 

Using this in Q, 

^\\8p\\\ < 2Co-\\6p M 4 1 /V^- (12) 
s 

Using || 5p M c ||i < || 5/9 ||i, it follows that 

\\SP\\i < 2Co-s/^/n~4>, (13) 

which also satisfies the bound in Hence, the bound ^ holds under both possible scenarios (||<%/||i < 
Ca/y/n true or false) and the proof is complete. 

□ 
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Lemma 2. Let j3 ols be the least squares estimator restricted to S: 

p ols = argminp || Y - X/3\\j such that p N = 0. 
Ifmiiik^sPk — Ca/y/ncj), Then 

P0 ols = f3 oracle ) > l-s(l-$(C)), 
and, with at least the same probability 1 — s(l — $(C)), 

||/3*-/3°™ cZe |U < Co I '^4 

Proof. It is only necessary to show that minfegg /3% ls > with probability at least 1 — s(l — $(C)). 

The error term has, under the made assumptions, a normal distribution, f3f s — (3* ~ A/"(0, a 2 {nY,ss)~k ) 
for all fc G S. The minimal eigenvalue of Ess is bounded from below by </> by the compatibility condition 
and the variance of f3%. ls is thus bounded from above by 0~ 1 cr 2 jn for all k 6 S. It follows with Bonferroni's 
inequality that, with probability at least 1 — s(l — 3>(C)), 

W -P° ls \U<Ca/^n~4>. (14) 

If min fees (3* > Ca/*Jv4, then |m]) implies that min fceS f3° ls > and thus /3 oracle = (3° ls and thus also 

||r-/3° rade ||co <Ca/^, 
which completes the proof. □ 
Lemma 3. With probability at least 1 — p(l — $(C)), 

max(Y - X/3 orQde ) T X fc < Ca^. 

keN 

Proof. We condition on the event f3 oracle = f3 ols , which happens according to Lemma [2] with probability 
at least 1 - s(l - $(C)). Then Y - X/3 oracZe = Y X/3°' s = P S ±Y, where P S ±Z is the projection 
of a vector Z e R™ into the space orthogonal to Xg. Now , Psj_Y = Pgj_(X/3* + e) = Psj_£. The 
distribution of (Ps±e) T "Kk is, for every k £ N, normal with mean and variance at most o~ 2 n, and thus 
P({P s ±e) T X k > Ca^fn) < 1 - $(C) for all /c G TV and, using a Bonferroni bound, P(max fceA r(P S j_e) T X fe > 
Co-y/n) < \N\(l — <I>(C)). The unconditional probability of maxfe e jv(Psj_e) T Xfc > Ca-Jn is thus at least 
1 - s(l - $(C)) - |iV|(l - $(C)) = 1- (s+ |iV|)(l - $(C)) = 1 -p(l - $(C)), which completes the proof. 

□ 
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